% Built off of material by Chris Krogslund
For more info, see The Guardian Datablog Olympic medal winners: every one since 1896 as open data
First, we’ll use dplyr and tidyr to count the medals of each type won, for each country and each year.
head(medals)
## City Sport Discipline Athlete NOC Gender
## 1 Athens Aquatics Swimming HAJOS, Alfred HUN Men
## 2 Athens Aquatics Swimming HERSCHMANN, Otto AUT Men
## 3 Athens Aquatics Swimming DRIVAS, Dimitrios GRE Men
## 4 Athens Aquatics Swimming MALOKINIS, Ioannis GRE Men
## 5 Athens Aquatics Swimming CHASAPIS, Spiridon GRE Men
## 6 Athens Aquatics Swimming CHOROPHAS, Efstathios GRE Men
## Event Event_gender Medal Year Country
## 1 100m freestyle M Gold 1896 Hungary
## 2 100m freestyle M Silver 1896 Austria
## 3 100m freestyle for sailors M Bronze 1896 Greece
## 4 100m freestyle for sailors M Gold 1896 Greece
## 5 100m freestyle for sailors M Silver 1896 Greece
## 6 1200m freestyle M Bronze 1896 Greece
# dplyr and tidyr refresher
medal_counts <- medals %>% group_by(Medal, Year, Country) %>% summarise(count = n())
head(medal_counts)
## Source: local data frame [6 x 4]
## Groups: Medal, Year [1]
##
## Medal Year Country count
## <chr> <int> <chr> <int>
## 1 Bronze 1896 Austria 2
## 2 Bronze 1896 Denmark 3
## 3 Bronze 1896 France 2
## 4 Bronze 1896 Germany 2
## 5 Bronze 1896 Greece 22
## 6 Bronze 1896 Hungary 3
This table is in tidy format. Wide (untidy) format can be useful for plotting in base plot (more on this later)
medal_counts_wide <- medal_counts %>% spread(key = Medal, value = count) %>%
ungroup() %>%
mutate(Bronze = ifelse(is.na(Bronze), 0, Bronze)) %>%
mutate(Silver = ifelse(is.na(Silver), 0, Silver)) %>%
mutate(Gold = ifelse(is.na(Gold), 0, Gold))
head(medal_counts_wide)
## # A tibble: 6 x 5
## Year Country Bronze Gold Silver
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1896 Australia 0 2 0
## 2 1896 Austria 2 2 1
## 3 1896 Denmark 3 1 2
## 4 1896 France 2 5 4
## 5 1896 Germany 2 26 5
## 6 1896 Greece 22 10 20
Finally, let’s subset the data to gold medal counts for the US, for easier plotting.
usa_gold_medals <- medal_counts %>%
filter(Medal == "Gold") %>%
filter(Country == "United States")
The general call for base plot looks something like this:
plot(x=, y=, ...)
Additional parameters can be passed in to customize the plot:
More layers can be added to the plot with additional calls to lines, points, text, etc.
plot(medal_counts_wide$Year, medal_counts_wide$Gold) # Basic
plot(usa_gold_medals$Year, usa_gold_medals$count, type = "l",
main = "USA Gold Medals",
xlab = "Year", ylab = "Count") # with updated parameters
points(x = 1984, y = usa_gold_medals$count[usa_gold_medals$Year == 1984],
col = "red", pch = 16)
These are just a few other types of plots you can make in base graphics.
boxplot(Gold~Year, data = medal_counts_wide)
hist(medal_counts_wide$Gold)
plot(density(medal_counts_wide$Gold))
barplot(usa_gold_medals$count, width = 4, names.arg = usa_gold_medals$Year, main = "USA Gold Medals")
mosaicplot(Year~Medal, medal_counts)
medal_lm <- lm(Gold ~ Bronze + Silver, data = medal_counts_wide)
plot(medal_counts_wide %>% select(-Country)) # Calls plotting method for class of the dataset ("data.frame")
plot(medal_lm, which=1:2) # Calls plotting method for class of medal_lm object ("lm"), print first two plots only
lattice is
faster (though only noticeable over many and large plots)
simpler (at first)
better at trellis graphs
able to do 3d graphs
ggplot2 is
generally more elegant
more syntactically logical (and therefore simpler, once you learn it)
better at grouping
able to interface with maps
The general call for lattice graphics looks something like this:
graph_type(formula, data=, [options])
The most common graph types: * xyplot: generic scatterplot * barchart * bwplot: boxplot * histogram * cloud: 3D scatterplot
The specifics of the formula differ for each graph type, but the general format is straightforward
y # Show the distribution of y
y~x # Show the relationship between x and y
y~x|A # Show the relationship between x and y conditional on the values of A
y~x|A*B # Show the relationship between x and y conditional on the combinations of A and B
z~y*x # Show the 3D relationship between x, y, and z
Let’s recreate some of the plots we made above using lattice. The options for lattice overlap with the options for base graphics.
xyplot(Gold~Year, data = medal_counts_wide, main = "Gold Medal Counts")
xyplot(count~Year, data = usa_gold_medals, type = "l", main = "USA Gold Medals")
bwplot(Year~Gold, data = medal_counts_wide, main = "Gold Medal Counts") # Labels are lost when lattice automatically converts Year to a factor
bwplot(Gold~Year, data = medal_counts_wide, main = "Gold Medal Counts", horizontal = FALSE) # Flip the coordinates so year increases on the x-axis
histogram(~Gold, medal_counts_wide, type="density", main = "Gold Medal Counts")
densityplot(~Gold, medal_counts_wide, main = "Gold Medal Counts")
barchart(count~Year, data = usa_gold_medals, width = 4, main = "USA Gold Medals", horizontal = FALSE)
The general call for ggplot2 graphics looks something like this:
ggplot(data=, aes(x=,y=, [options])) + geom_xxxx() + ... + ... + ...
Note that ggplot2 graphs in layers in a continuing call (hence the endless +…+…+…), which really makes the extra layer part of the call
...+geom_xxxx(data=, aes(x=,y=,[options]),[options])+...+...+...
You can see the layering effect by comparing the same graph with different colors for each layer
p <- ggplot(data=medal_counts_wide, aes(x=Year, y=Gold)) + geom_point(color="gold")
p
p + geom_point(aes(x=Year, y=Silver), color="gray") + ylab("Medals")
ggplot2 syntax is very different from base graphics and lattice. It’s built on the grammar of graphics. The basic idea is that the visualization of all data requires four items:
One or more statistics conveying information about the data (identities, means, medians, etc.)
A coordinate system that differentiates between the intersections of statistics (at most two for ggplot, three for lattice)
Geometries that differentiate between off-coordinate variation in kind
Scales that differentiate between off-coordinate variation in degree
ggplot2 allows the user to manipulate all four of these items.
ggplot(data=, aes(x=, y=, color=, linetype=, shape=, size=))
ggplot2 is optimized for showing variation on all four aesthetic types
# Differences in kind using color
pigs.data <- data[data$country %in% c("Greece", "Portugal", "Ireland", "Spain"),]
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_line(aes(color=country))
usa_all_medals <- medal_counts %>% filter(Country == "United States")
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(aes(color = Medal))
Note what happens when we specify the color parameter outside of the aesthetic operator. ggplot2 views these specifications as invalid graphical parameters.
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(color = Medal)
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomLine, : object 'Medal' not found
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(color = "Medal")
## Error in grDevices::col2rgb(colour, TRUE): invalid color name 'Medal'
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(color = "red")
usa_all_medals_untidy <- medal_counts_wide %>% filter(Country == "United States") # Medal is the variable, but it spans multiple columns
# Base graphics call
plot(usa_all_medals_untidy$Year, usa_all_medals_untidy$Gold, col = "green", type = "l")
lines(usa_all_medals_untidy$Year, usa_all_medals_untidy$Silver, col = "blue")
lines(usa_all_medals_untidy$Year, usa_all_medals_untidy$Bronze, col = "red")
legend("right", legend = c("Gold", "Silver", "Bronze"), fill = c("green", "blue", "red"))
# ggplot2 call
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(aes(color = Medal))
Medal as an aesthetic parameter that differentiates kinds of statistics. Base graphics treats each medal type as a layer to the plot.ggplot(usa_gold_medals, aes(x = Year, y = count)) + geom_line()
# This combines the subsetting and plotting into one step
medal_counts %>%
filter(Medal == "Gold") %>%
filter(Country == "United States") %>%
ggplot(aes(x = Year, y = count)) + geom_line()
ggplot(data = usa_gold_medals, aes(x = count)) + geom_density() # ggplot2
densityplot(~count, data = usa_gold_medals) # lattice
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() # ggplot2
xyplot(count~Year, data = usa_gold_medals) # lattice
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_line() # ggplot2
xyplot(count~Year, data = usa_gold_medals, type = "l") # lattice
# Create a dataframe of median number of gold medals by country
median_gold_medals <- medal_counts %>%
filter(Medal == "Gold") %>%
group_by(Country) %>%
summarise(med = median(count))
ggplot(data = median_gold_medals[1:15, ], aes(x = Country, y = med)) + geom_bar(stat="identity") # ggplot2
barchart(med~Country, data = median_gold_medals[1:15, ]) # lattice
# Notice that here, you must explicitly convert numeric years to factors
ggplot(data = medal_counts_wide, aes(x = factor(Year), y = Gold)) + geom_boxplot() # ggplot2
bwplot(Gold~factor(Year), data = medal_counts_wide, horizontal = FALSE) # lattice
# Subset the data to North America countries for easier viewing
northern_hem <- medal_counts_wide %>% filter(Country %in% c("United States", "Canada", "Cuba", "Mexico"))
ggplot(data = northern_hem, aes(x = Year, y = Gold)) + geom_point() + facet_wrap(~Country) # ggplot2
xyplot(Gold~Year|Country, data = northern_hem) # lattice
data(volcano) # Load volcano contour data
volcano[1:10, 1:10] # Examine volcano dataset (first 10 rows and columns)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 100 100 101 101 101 101 101 100 100 100
## [2,] 101 101 102 102 102 102 102 101 101 101
## [3,] 102 102 103 103 103 103 103 102 102 102
## [4,] 103 103 104 104 104 104 104 103 103 103
## [5,] 104 104 105 105 105 105 105 104 104 103
## [6,] 105 105 105 106 106 106 106 105 105 104
## [7,] 105 106 106 107 107 107 107 106 106 105
## [8,] 106 107 107 108 108 108 108 107 107 106
## [9,] 107 108 108 109 109 109 109 108 108 107
## [10,] 108 109 109 110 110 110 110 109 109 108
volcano3d <- melt(volcano) # Use reshape2 package to melt the data into tidy form
head(volcano3d) # Examine volcano3d dataset (head)
## Var1 Var2 value
## 1 1 1 100
## 2 2 1 101
## 3 3 1 102
## 4 4 1 103
## 5 5 1 104
## 6 6 1 105
names(volcano3d) <- c("xvar", "yvar", "zvar") # Rename volcano3d columns
ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) + geom_contour() # ggplot2
contourplot(zvar~xvar + yvar, data = volcano3d) # lattice
ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) + geom_tile(aes(fill = zvar)) # ggplot2
levelplot(zvar~xvar + yvar, data = volcano3d) # lattice
# Create a subset of the dataset containing only data for France
cloud(Gold~Bronze*Silver, data = northern_hem)
cloud(Gold~Bronze*Silver|Country, data = northern_hem)
ggplot(data=data, aes(x=year, y=outlays)) + geom_point() +
xlab(label="Voter Turnout (%)") + ylab(label="Government Outlays") +
ggtitle(label="Cool Graph") # ggplot2
xyplot(outlays~year, data=data, xlab="Year", ylab="Government Outlays", main
="Cool Graph") # lattice
ggplot(data=data, aes(x=year, y=outlays)) + geom_point() # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(size=3) # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(size=1) # ggplot2
xyplot(outlays~year, data=data) # lattice
xyplot(outlays~year, data=data, cex=2) # lattice
xyplot(outlays~year, data=data, cex=.5) # lattice
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(color=colors()[145]) # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(color="red") # ggplot2
xyplot(outlays~year, data=data, col=colors()[145]) #lattice
xyplot(outlays~year, data=data, col="red") #lattice
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(shape=3) # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(shape=15) # ggplot2
xyplot(outlays~year, data=data, pch=3) # lattice
xyplot(outlays~year, data=data, pch=15) # lattice
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(shape=3) # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(shape=15) # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(shape="w") # ggplot2
ggplot(data=data, aes(x=year, y=outlays)) + geom_point(shape="$", size=5) # ggplot2
xyplot(outlays~year, data=data, pch=3) # lattice
xyplot(outlays~year, data=data, pch=15) # lattice
xyplot(outlays~year, data=data, pch="w") # lattice
xyplot(outlays~year, data=data, pch="$", cex=2) # lattice
ggplot(data=data[data$country=="USA",], aes(x=year, y=outlays)) +
geom_line(linetype=1) # ggplot2
ggplot(data=data[data$country=="USA",], aes(x=year, y=outlays)) +
geom_line(linetype=2) # ggplot2
ggplot(data=data[data$country=="USA",], aes(x=year, y=outlays)) +
geom_line(linetype=3) # ggplot2
ggplot(data=data[data$country=="USA",], aes(x=year, y=outlays)) +
geom_line(linetype=3, size=1) # ggplot2
ggplot(data=data[data$country=="USA",], aes(x=year, y=outlays)) +
geom_line(linetype=3, size=1.5) # ggplot2
ggplot(data=data[data$country=="USA",], aes(x=year, y=outlays)) +
geom_line(linetype=3, size=2) # ggplot2
xyplot(outlays~year, data=data[data$country=="USA",], type="l", lty=1) # lattice
xyplot(outlays~year, data=data[data$country=="USA",], type="l", lty=2) # lattice
xyplot(outlays~year, data=data[data$country=="USA",], type="l", lty=3) # lattice
xyplot(outlays~year, data=data[data$country=="USA",], type="l", lty=3, lwd=2) # lattice
xyplot(outlays~year, data=data[data$country=="USA",], type="l", lty=3, lwd=3) # lattice
xyplot(outlays~year, data=data[data$country=="USA",], type="l", lty=3, lwd=4) # lattice
# Differences in kind using line types
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_line(aes(linetype=country))
# Differences in kind using point shapes
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point(aes(shape=country))
# Differences in degree using color
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point(aes(color=realgdpgr))
# Differences in degree using point size
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point(aes(size=realgdpgr))
# Multiple non-cartesian aesthetics (differences in kind using color, degree using point size)
ggplot(data=pigs.data, aes(x=year, y=outlays)) +
geom_point(aes(color=country,size=realgdpgr))
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point()
# Add linear model (lm) smoother
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point() +
geom_smooth(method="lm")
# Add local linear model (loess) smoother, span of 0.75
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point() +
geom_smooth(method="loess", span=.75)
# Add local linear model (loess) smoother, span of 0.25
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point() +
geom_smooth(method="loess", span=.25)
# Add linear model (lm) smoother, no standard error shading
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point() +
geom_smooth(method="lm", se=F)
# Add local linear model (loess) smoother, no standard error shading
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point() +
geom_smooth(method="loess", se=F)
# Add a local linear (loess) smoother for each country
ggplot(data=pigs.data, aes(x=year, y=outlays)) + geom_point(aes(color=country)) +
geom_smooth(aes(color=country))
# Add a local linear (loess) smoother for each country, no standard error shading
ggplot(data=pigs.data, aes(x=year, y=outlays)) +
geom_point(aes(color=country, size=realgdpgr)) +
geom_smooth(aes(color=country), se=F)
# Initialize gridExtra library
library(gridExtra)
# Create 3 plots to combine in a table
plot1 <- ggplot(data=pigs.data, aes(x=year, y=outlays, color=)) +
geom_line(aes(color=country))
plot2 <- ggplot(data=pigs.data, aes(x=year, y=outlays, linetype=)) +
geom_line(aes(linetype=country))
plot3 <- ggplot(data=pigs.data, aes(x=year, y=outlays, shape=)) +
geom_point(aes(shape=country))
# Call grid.arrange
grid.arrange(plot1, plot2, plot3, nrow=3, ncol=1)
Two basic image types
Every pixel of a plot contains its own separate coding; not so great if you want to resize the image
jpeg(filename="example.png", width=, height=)
plot(x,y)
dev.off()
Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing
pdf(filename="example.pdf", width=, height=)
plot(x,y)
dev.off()
# Assume we saved our plot is an object called example.plot
# lattice
trellis.device(device="pdf", filename="example.pdf")
print(example.plot)
dev.off()
# ggplot2
ggsave(filename="example.pdf", plot=example.plot, scale=, width=, height=) # ggplot2
You’re welcome to try out either lattice or ggplot for these questions, but in the solutions we’ll focus on the ggplot approach.
For some of these you may want to use a smaller version of the dataset, such as a random subset, subset <- air[sample(1:nrow(air), 10000, replace = FALSE), ].
Plot a histogram of the flight delays with negative delays set to zero, censoring delay times at a maximum of 60 minutes.
Plot the arrival delay against the departure delay as a scatterplot.
Clean up your scatterplot with a title and axis labels. Output it as a PDF and see if you’d be comfortable with including it in a report/paper.
Make a boxplot of the departure delay as a function of the day of week.
Create a trellis plot of departure delay boxplots, one per destination for this subset of destinations, DestSubset <- c('LAX','SEA','PHX','DEN','MSP','JFK','ATL','DFW','IAH', 'ORD'). Use a 2x5 layout of panels in the plot.
Subset the data to flights going to Chicago (ORD) and Houston (IAH). Plot arrival delay against scheduled departure time (CRSDepTime). Now plot so that flights to Chicago are in one color and those to Houston in another. Use scale_x_continuous() and scale_y_continuous() to set the x-axis limits to be in the range from 6 am to midnight and the y-axis limits to be in the range (-10, 120).